Data Science - Let’s get this started!
This is the link to my github repository: https://github.com/LonavanDelden/IODS-project.git
here is the link to the diary page
Starting the chapter 2 exercise by ensuring the right working directory and reading in the created data set from the data wrangling exercise. My data set was saved as ‘learning2014_1.csv’.
Reading the data as ‘student2014’ including 7 variables in 7 columns and 166 observations overall. The first column ‘gender’ is the factor used to categorize the observations in the following statistical analysis.
setwd("C:/Users/lonav/Documents/IODS-project")
students2014 <- read.table("learning2014_1.csv", sep=",", header=T)
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
This paired data overview shows the relationship of each variable to each other. First, more data are available from females than males. According to the line charts, the distribution of most variables is simular when comparing female and male results. The boxplots show the general distribution of the data and how wide spread they are in relation to the mean. In this case highlighting ‘age’ with the most outliers, meaning the data are mostly on young people in their 20s and 30s. The scatter plots visualize the interaction of the variables indicating any correlation present. Another way of looking into the correlation is with the correlation coefficient ‘Cor’, determining the relationship of two variables by giving it a value that can easily be compared (ranging from 0 to 1 or -1, with the strongest correlation at 1, negative values just mean that if one variable increases that the other decreases instead of increasing as well). For this data set we look mainly into what can influence the ‘points’ of someone, so we focus on the correlations coefficient in that column. The strongest relationship is between ‘points’ and ‘attitude’ followed by ‘strat’ and ‘surf’.
library(GGally)
## Loading required package: ggplot2
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The variables ‘attitude’, ‘stra’, and ‘surf’ were choosen based on their correlation coefficient with the dependent variable ‘points’. This model gives descriptive statistics of the data, such as the minimum, maximum and median values as well as the standard error and R-squared. The intercept tells where the modeled outcome equals zero. Additionally, it predicts the amount of change for each variable when the dependant variable ‘points’ increases one unit. E.g. when ‘points’ increases on unit, ‘attitude’ increases by 3.4, ‘stra’ increases by 0.9 and ‘surf’ decreases by 0.6. However, the p value indicates if this relationship is significant or not, which just means how likely it is that the model prediction is accurate. According to the model output, from the three variables only ‘attitude’ is highly significant.
library(ggplot2)
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Based on the outcome of the first model, ‘stra’ and ‘surf’ will now be excluded from the model. The estimated outcome has changed a little and the standard error has decreased.The multiple R-squared of 0.2 indicates a weak (but highly significant) positive relationship. This means that a change in attitude has a weak effect on the Points, meaning increases the points a little when increasing itself.
library(ggplot2)
my_model2 <- lm(points ~ attitude, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The model assumes that the data are normally distributed and the relationshsip between the variables is linear. The following diagnostic plots can be used to check on the validity of these assumtions. The ‘Residuals vs Fitted’ plot is indicating that the relationship is linear by showing a mostly horizontal line, but could be improved with e.g. higher sample size or additional variables in the model. The ‘Normal Q-Q’ plot is also mostly supporting the assumptions of normal distribution by alligning the data along the line. The ‘Residuals vs Leverage’ shows that there are a few outliers but not extremly influencial. Overall, the diagnostic plots indicate that the assumptions of the model are valid.
plot(my_model2, which=c(2,1,5))
before knitting:
install.packages(“GGally”) library(GGally) library(ggplot2)
Starting the chapter 3 exercise by ensuring the right working directory and reading in the created data set from the data wrangling exercise and print the variable names. My data set was saved as ‘alc.csv’. The data set includes 382 observations with 35 variables. The observations are from Portuguese students which now will be analysed for their alcohol consumption and to identify which variables are related factors. The variables in question are listed below in the column names, with emphasis on the computed alcohol use “alc_use”, which is the avergae of daily and weekly alcohol consumption ‘(Dalc+Walc)/2’. If this use is higher than 2 than the logical factor “high_use” appears TRUE. This high alcohol usage will be the target variable for the exercise.
getwd()
## [1] "C:/Users/lonav/Documents/IODS-project"
setwd("C:/Users/lonav/Documents/IODS-project")
library(openxlsx)
alc <- read.csv("alc.csv", sep = ",", header = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6...
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, ...
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, ...
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11,...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
The variables failures, absences, sex and age are choosen as predictors. The hypotheses are that failures, absences and age increase with alcohol consumption. Due to popular consensus women will have less high alcohol usage than men. Findings: Failures, absences and sex were significant predictors. Age, on the other hand, showed no significant influence on high alcohol usage. Therefore, the hypothesis that failures and absences increase with high alcohol usage can be confirmed but for the impact of age is no confidence. The boxplot and cross table highlight the lower alcohol usage by women compared to men, which confirms the hypothesis.
Numerical exploration with the logistic regression mnodel:
library(GGally)
library(ggplot2)
m <- glm(high_use ~ failures + absences + sex + age, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + age, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6644 -0.8051 -0.6026 1.0478 2.1115
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.82153 1.72779 -2.212 0.0270 *
## failures 0.37915 0.15333 2.473 0.0134 *
## absences 0.07059 0.01795 3.932 8.43e-05 ***
## sexM 0.98875 0.24475 4.040 5.35e-05 ***
## age 0.11375 0.10394 1.094 0.2738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 417.44 on 377 degrees of freedom
## AIC: 427.44
##
## Number of Fisher Scoring iterations: 4
Graphical exploration:
library(ggplot2)
g <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
g <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g + geom_boxplot() + ylab("age") + ggtitle("Student age by alcohol consumption and sex")
select(alc, high_use, failures, absences, sex, age) %>% tail(10)
## high_use failures absences sex age
## 373 FALSE 1 0 M 19
## 374 TRUE 1 14 M 18
## 375 FALSE 0 2 F 18
## 376 FALSE 0 7 F 18
## 377 FALSE 1 0 F 19
## 378 FALSE 0 0 F 18
## 379 FALSE 1 0 F 18
## 380 FALSE 1 0 F 18
## 381 TRUE 0 3 M 17
## 382 TRUE 0 0 M 18
table(high_use = alc$high_use, failures = alc$failures)
## failures
## high_use 0 1 2 3
## FALSE 234 22 5 9
## TRUE 82 16 6 8
table(high_use = alc$high_use, absences = alc$absences)
## absences
## high_use 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## FALSE 94 2 54 2 36 4 19 6 13 3 12 1 6 0 7 1 2 0 2 0 1 1
## TRUE 22 1 13 4 15 0 11 2 8 0 5 1 4 3 4 2 4 1 2 1 1 0
## absences
## high_use 22 23 24 25 26 28 30 54 56 75
## FALSE 0 1 0 1 1 0 0 0 0 1
## TRUE 3 0 1 0 0 1 1 1 1 0
table(high_use = alc$high_use, sex = alc$sex)
## sex
## high_use F M
## FALSE 157 113
## TRUE 41 71
table(high_use = alc$high_use, age = alc$age)
## age
## high_use 15 16 17 18 19 20 22
## FALSE 63 79 65 55 7 1 0
## TRUE 18 28 35 26 4 0 1
The model was adjusted based on the insignificant influence of the age variable (see the insignificant ANOVA output at the bottom of the next window), which increases the significance level of the other variables. The remaining three variables are positively correlated to alcohol usage.
m1 <- glm(high_use ~ failures + absences + sex + age, data = alc, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + age, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6644 -0.8051 -0.6026 1.0478 2.1115
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.82153 1.72779 -2.212 0.0270 *
## failures 0.37915 0.15333 2.473 0.0134 *
## absences 0.07059 0.01795 3.932 8.43e-05 ***
## sexM 0.98875 0.24475 4.040 5.35e-05 ***
## age 0.11375 0.10394 1.094 0.2738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 417.44 on 377 degrees of freedom
## AIC: 427.44
##
## Number of Fisher Scoring iterations: 4
m2 <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6629 -0.8545 -0.5894 1.0339 2.0428
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95397 0.22819 -8.563 < 2e-16 ***
## failures 0.40462 0.15024 2.693 0.00708 **
## absences 0.07294 0.01796 4.061 4.88e-05 ***
## sexM 0.98848 0.24453 4.042 5.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 418.64 on 378 degrees of freedom
## AIC: 426.64
##
## Number of Fisher Scoring iterations: 4
anova(m1, m2, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ failures + absences + sex + age
## Model 2: high_use ~ failures + absences + sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 377 417.44
## 2 378 418.64 -1 -1.2029 0.2727
Odds ratios and their confidence interval are explored:
The odds are that with high alcohol usage it is 2.7 times more likely that the student is a man, 1.5 times more likely that the student fails and 1.1 times more likely to be absent. This outcome shows that sex is the strongest predictor of high alcohol usage. The confidence interval indicates the prediction range of the computed mean, i.e. we are up to 97.5% confident that the odds of the student with high alcohol usage is a man lie between 1.6 and 4.4 times.
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1417107 0.08883883 0.2178283
## failures 1.4987270 1.11549818 2.0187171
## absences 1.0756623 1.04072883 1.1163576
## sexM 2.6871365 1.67434331 4.3755694
Predictions computed and visualized by a 2x2 cross tabulation: The predictions were 284 times correct (FALSE=FALSE & TRUE=TRUE) and 98 times incorrect (FALSE=TRUE), i.e. more than a third of the predictions did not match the observations.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## failures absences sex high_use probability prediction
## 373 1 0 M FALSE 0.3633449 FALSE
## 374 1 14 M TRUE 0.6130701 TRUE
## 375 0 2 F FALSE 0.1408685 FALSE
## 376 0 7 F FALSE 0.1910175 FALSE
## 377 1 0 F FALSE 0.1751799 FALSE
## 378 0 0 F FALSE 0.1241213 FALSE
## 379 1 0 F FALSE 0.1751799 FALSE
## 380 1 0 F FALSE 0.1751799 FALSE
## 381 0 3 M TRUE 0.3215447 FALSE
## 382 0 0 M TRUE 0.2757800 FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 12
## TRUE 86 26
library(dplyr)
library(ggplot2)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.03141361 0.70680628
## TRUE 0.22513089 0.06806283 0.29319372
## Sum 0.90052356 0.09947644 1.00000000
The training error & cross-validation:
The training error indicates the average number of wrong predictions of about 26%. After 10-fold cross-validation, my model error decresed a notch to 25%. with Depending on the scientific field (I am from environmental science) this is quite high and could indicate missing variables with high influence. However, when it comes to my hypotheses, the model clearly proofed me wrong about age being an influencial variable in this data set. Age might become a significant factor if the observations would range throughout the whole lifetime (bigger data set). I would still guess that student in their 40s are less likely to correlate with high alcohol consumption but there are no observations in this data set on this. But the other 3 hypotheses were confirmed, sometime even statistics support logical thinking :)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2565445
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
cv$delta[1]
## [1] 0.2591623
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2591623
before kniting: library(GGally) library(ggplot2)
Read in the data and explore. The dataset is on Housing Values in Boston Suburbs in the USA. The variables include crime statistics and information on residency types and values, such as access to infrastructure, schooling etc. Please find the summary below.
library(MASS)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The graphical overview shows that the variables are not normally distributed (maybe except for “rm” - rooms per dwelling).
The matrix is highlighting the strongest positive correlation between the accessibility to radial highways (“rad”) and the full-value property-tax rate (“tax), which means that the property taxes increase substantially when it is close to long distance connections, which is crucial for many industries.
The strongest negative correlations are between the weighted mean of distances to five Boston employment centres (“dis”) with age of the building, nitrogen oxide concentrations (“nox”) and the proportion of non-retail business acres per town (“indus”). This means when the distance to the employment centres increases, the buildung age, nox levels and non-retail businesses decreases, which makes sense as new buildings are more common in the outskirt of town (easier to build new than to renovate), less polution from traffic and the businesses like the crowded center better as well. Another strong negative correlation is between the lower status of the population (“lstat”) and the median value of owner-occupied homes (“medv”), which is a no-brainer.
library(corrplot)
library(tidyverse)
library(GGally)
ggpairs(Boston, upper = list(continuous = wrap("cor", size=2.6)))
cor_matrix <- cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The data set is standardized by scaling the variables according to their mean and standard deviation. Please see the summary below. The mean is now 0 for all the variables and the values are overall a lot lower as they are now scaled to their vairiable dependent values instead of real values.
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
train_cluster <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 -0.487 0.585 -0.487 -0.487 ...
## $ indus : num -1.033 1.015 -0.915 -0.437 -1.033 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.386 1.366 -1.111 -0.144 -0.386 ...
## $ rm : num -0.6058 -0.4962 0.4759 -0.4193 0.0432 ...
## $ age : num 0.00444 0.41654 0.06484 0.46627 0.17141 ...
## $ dis : num -0.519 -0.482 0.763 0.22 -0.227 ...
## $ rad : num -0.522 1.66 -0.637 -0.637 -0.522 ...
## $ tax : num -0.666 1.529 -0.755 -0.601 -0.666 ...
## $ ptratio: num -0.857 0.806 0.251 1.175 -0.857 ...
## $ black : num 0.4 -3.868 0.427 0.329 0.426 ...
## $ lstat : num -0.422 0.6 -0.761 0.282 -0.891 ...
## $ medv : num 0.00731 -0.98214 0.14865 -0.54722 0.22477 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 2 4 1 3 1 4 3 4 2 2 ...
str(test)
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num -0.4872 0.0487 0.0487 0.0487 -0.4872 ...
## $ indus : num -1.306 -0.476 -0.476 -0.476 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.834 -0.265 -0.265 -0.265 -0.144 ...
## $ rm : num 0.207 -0.93 -0.399 0.131 -0.794 ...
## $ age : num -0.3508 1.1164 0.6155 0.9139 0.0329 ...
## $ dis : num 1.076671 1.086122 1.32832 1.21178 0.000692 ...
## $ rad : num -0.752 -0.522 -0.522 -0.522 -0.637 ...
## $ tax : num -1.105 -0.577 -0.577 -0.577 -0.601 ...
## $ ptratio: num 0.113 -1.504 -1.504 -1.504 1.175 ...
## $ black : num 0.41 0.328 0.329 0.393 0.375 ...
## $ lstat : num -1.042 2.419 0.623 1.092 -0.192 ...
## $ medv : num 0.671 -0.656 -0.395 -0.819 -0.471 ...
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2400990 0.2599010 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 1.04540726 -0.9410872 -0.11793298 -0.9061276 0.4435558
## med_low -0.08764296 -0.3412519 -0.02879709 -0.5679221 -0.1006331
## med_high -0.37126774 0.1551471 0.06513667 0.3424809 0.0583992
## high -0.48724019 1.0149946 -0.07547406 1.0259521 -0.4423951
## age dis rad tax ptratio
## low -0.9345413 0.9426173 -0.7015101 -0.7000372 -0.45178163
## med_low -0.3869946 0.3298961 -0.5390602 -0.4495326 -0.06946085
## med_high 0.3771904 -0.3495764 -0.4218568 -0.3333956 -0.30631378
## high 0.7767667 -0.8208859 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.3777427 -0.775359064 0.51812319
## med_low 0.3242150 -0.258830003 0.03790726
## med_high 0.1185324 -0.002069374 0.18375902
## high -0.8804332 0.892015389 -0.68215019
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06592796 0.671249335 -0.96119252
## indus 0.06475804 -0.437731476 0.10005789
## chas -0.07224804 -0.019911440 0.12836880
## nox 0.26114391 -0.721492002 -1.32265542
## rm -0.09130611 -0.015240272 -0.16547723
## age 0.28066247 -0.342963819 -0.16336714
## dis -0.03010664 -0.296490143 -0.04543333
## rad 3.44946645 0.638971092 -0.13435656
## tax -0.07066018 0.431781950 0.68727033
## ptratio 0.14090790 0.049024527 -0.23616799
## black -0.16500111 0.004877811 0.11536911
## lstat 0.22070590 -0.258963489 0.06561186
## medv 0.19033962 -0.419267976 -0.31683839
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9519 0.0385 0.0097
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.4)
The LDA model results in about a third of the classes being predicted wrong when compared to the real data (70 classes correct as in low=low etc. and 32 classes wrong as in low=low_med etc).
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 13 2 0
## med_low 3 20 6 0
## med_high 0 2 18 1
## high 0 0 1 26
Working on the original dataset “Boston”“, which is scaled again. The data are prepared, explored and clustered by K means. The k plot determined two clusters with the strong change below 2.5. Therefore the clusters for the visualization was set to two. The matrix visualizes all variables in when plotted against each other, therefore the top and the bottom are mirrored.
library(MASS)
data('Boston')
boston_scaled1 <- scale(Boston)
boston_scaled1 <- as.data.frame(boston_scaled1)
dist_eu <- dist(boston_scaled1)
dist_man <- dist(boston_scaled1, method = "manhattan")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled1, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(boston_scaled1, centers = 2)
pairs(boston_scaled1, col = km$cluster)
To see the variables more clearly, the plot was cut into smaller pieces of 5 variables max. Not all variables show a clear clustering when compared to each other but e.g. proportion of non-retail businesses (indus), nitrogen oxides (nox),building age (age) and median home value (medv) seperate quite well into clusters when compared to other variables. Especially nitrogen oxides (nox) show clear cluster seperation around the mean when paired with age, rm and dis etc. Just to clarify, the data are scaled, which means in this case the cluster seperate the data in relation to the mean.
library(MASS)
km <-kmeans(boston_scaled1, centers = 2)
pairs(boston_scaled1[1:5], col = km$cluster)
pairs(boston_scaled1[5:10], col = km$cluster)
pairs(boston_scaled1[10:14], col = km$cluster)
K-means performed on the original Boston data and standarzied by scaling with 3 clusters. The proportion of residential land zoned for lots over 25,000 sq.ft.s (zn) is the most influencial variable.
library(MASS)
library(ggplot2)
data('Boston')
boston_scaled2 <- scale(Boston)
km <-kmeans(boston_scaled2, centers = 3)
cluster <- km$cluster
boston_scaled2 <- data.frame(boston_scaled2, cluster)
lda.fit2 <- lda(cluster ~ ., data = boston_scaled2)
lda.fit2
## Call:
## lda(cluster ~ ., data = boston_scaled2)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2470356 0.4268775 0.3260870
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3989700 1.2614609 -0.9791535 -0.020354653 -0.8573235 1.0090468
## 2 -0.3788713 -0.3578148 -0.2879024 0.001080671 -0.3709704 -0.2328004
## 3 0.7982270 -0.4872402 1.1186734 0.014005495 1.1351215 -0.4596725
## age dis rad tax ptratio black
## 1 -0.96130713 0.9497716 -0.5867985 -0.6709807 -0.80239137 0.3552363
## 2 -0.05427143 0.1034286 -0.5857564 -0.5951053 0.01241316 0.2805140
## 3 0.79930921 -0.8549214 1.2113527 1.2873657 0.59162230 -0.6363367
## lstat medv
## 1 -0.9571271 1.06668290
## 2 -0.1047617 -0.09820229
## 3 0.8622388 -0.67953738
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.03206338 0.19094456
## zn 0.02935900 1.07677218
## indus 0.63347352 0.09917524
## chas 0.02460719 -0.10009606
## nox 1.11749317 0.75995105
## rm -0.18841682 0.57360135
## age -0.12983139 -0.47226685
## dis 0.04493809 0.34585958
## rad 0.67004295 0.08584353
## tax 1.03992455 0.58075025
## ptratio 0.25864960 0.02605279
## black -0.01657236 -0.01975686
## lstat 0.17365575 0.41704235
## medv -0.06819126 0.79098605
##
## Proportion of trace:
## LD1 LD2
## 0.8506 0.1494
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes2 <- as.numeric(boston_scaled2$cluster)
plot(lda.fit2, dimen = 2, col = classes2, pch = classes2, main = "LDA biplot using three clusters 1, 2 and 3")
lda.arrows(lda.fit2, myscale = 1.4)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
lda.fit <- lda(crime ~., data = train)
model_predictors <- dplyr::select(train, -crime)
dim(train_cluster)
## [1] 404 14
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling # if %*% is not working, check the dimensions! Second of first line needs to match first of the second line
matrix_product <- as.data.frame(matrix_product)
library(plotly)
p1 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
p1
p2 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_cluster$crime)
p2
train2 <- dplyr::select(train, -crime)
km3 <-kmeans(train2, centers = 3)
p3 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km3$cluster)
p3
before kniting: library(GGally) library(ggplot2)
The data are from the United Nations Development Programme to look into various variables to assess the Human Development Index (HDI) with more than just economic growth.
Here are my header explanations:
[“f.sec_edu” = Proportion of females with at least secondary education “m.sec_edu” = Proportion of males with at least secondary education “f.labour” = Proportion of females in the labour force “m.labour” = Proportion of males in the labour force]
The summary shows that the observations are spreading across various scales, this suggests a standardization application later on to compare them in further analysis. The pairs graphs show that some of the variables have normally distributed observations, some do not (e.g. GNI). The stongest (negative) relationship can be found between maternal mortality ratio (mat.mor_rat) and life expectancy at birth (life), quite logically when the mortality ratio increases the life expectancy decreases. Strong positive relationships can be found between mortality ratio and birth rate as well as life expectancy and expected years of schooling (exp_edu), which is a little less obvious. Overall education seems to have a stronger influence on life expectancy, birth rate, moratlity rate etc. than labour related factors.
library(openxlsx)
library(tidyr)
library(dplyr)
library(GGally)
library(corrplot)
library(ggfortify)
human <- read.csv("human.csv", sep = ",", header = TRUE)
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ f_m.sec_edu: num 1.007 0.997 0.983 0.989 0.969 ...
## $ f_m.labour : num 0.891 0.819 0.825 0.884 0.829 ...
## $ life : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ exp_edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ mat.mor_rat: int 4 6 6 5 6 7 9 28 11 8 ...
## $ birth.rate : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ par.repr : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
colnames(human)
## [1] "f_m.sec_edu" "f_m.labour" "life" "exp_edu" "GNI"
## [6] "mat.mor_rat" "birth.rate" "par.repr"
summary(human)
## f_m.sec_edu f_m.labour life exp_edu
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI mat.mor_rat birth.rate par.repr
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(human)
cor(human) %>% corrplot(type = "upper")
The difference between the PCA of standardized and unstandardized data is the distribution of the ploted data, which makes PC1 the sole explaining component in the unstandardized data.
pca_human <- prcomp(human)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The PCA on the standardized data identifies more individual parameters which reduces the PC1 effect to about 54%, which is still the majority. Now the original variables, which are the most influencial, can be identified to be mostly secondary education related as well as the birth rate, maternal mortality ratio, life expectancy and gross national income (variable arrows go along PC1=0). Additionally, the variables group around education, life expectancy and GNI, which means that they are positively correlated to each other, while negatively correlatd to the birth rate and maternal mortality ratio.
PC1 example: Highly educated + low maternal mortality ratio -> developed countries (Europe, Scandinavia); less educated + high maternal mortality ratio -> developing countries (Africa)
PC2 is the second most influencial component but with 16% a lot less strong, representing the situtation of women in the work force by combining women in parliament and the ratio of employed women to men (variable arrows go along PC2=0). These two varibles are close to a 90 degree angle to the other variables, which means that they are not likely to be correlated.
PC2 example: High ratio of women to men with work + women in high ranking jobs -> many women in the work force (Africa, Scandinavia); low ratio of women to men with work + less women in high ranking jobs -> less women in the work force (Middle East, Asia)
The PCA on these data highlight that the Human Development Index is mainly determined by the well known country development indicators such as gross national income, birth rate and secondary education but also that women in the work force is a valuable component to distinguish human development as e.g. some African countries are not doing well in the general development comparison but distinguish themselves clearly with their rate of women in the work force and highly ranking political jobs. This might give a more positive picture of euqality develoments in these countries as women in the parliament are more likely to work towards improvements in education and mortality rates as well.
human_std <- scale(human)
summary(human_std)
## f_m.sec_edu f_m.labour life exp_edu
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI mat.mor_rat birth.rate par.repr
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human <- prcomp(human_std)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = "Country Development (53.6%)", ylab = "Women in the work force (16.2%)")
Caption: GNI = Gross National Income per capita, life = Life expectancy at birth, exp_edu = Expected years of schooling, mat.mor_rat = Maternal mortality ratio, birth.rate = Adolescent birth rate, par.repr = Percetange of female representatives in parliament, f_m.sec_edu = Ratio of females to males with at least secondary education, f_m.labour = Ratio of females to males in the labour force
The tea data are quite complex with 36 variables on qualitative measures. Further analysis is done with a subset of 6 variables.The distribution of the observations per variable is shown in the bar plot bellow and highlights that they differ very much from each other with 2 to 4 qualitative measures.
library(FactoMineR)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
gather(tea_time) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
While the PCA clearly identified the most important component by grouping the correlated influencial variables, this MCA shows a more broad distribution of variables over the dimensions ranking from 5-15% explained variances (scree plot). This means that the first two dimensions are not as different from each other in significance as in the previous data set.
So, the biplot only explains about 30% of the data variances, which means more than 2/3 of the data are not included and some insight might be missed. The individuals (observations) are represented as the blue dots and variables as red triangles. The distance between any observation or variable gives a measure of their similarity (or dissimilarity). Observations with similar profile are closed on the factor map. Despite the low percentage of explained variances, the biplot shows some pattern. But a correlation of the variable to their dimension highlights a weak relationship. Including more variable might improve the outcome but so far I would say that tea habits are randomly distributed and only relate weakly to each other.
library(factoextra)
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))
fviz_mca_biplot(mca, ggtheme = theme_minimal())
plot(mca, invisible=c("ind"), habillage = "quali")
mca_pr <- round(100*s$importance[2, ], digits = 1)
mca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
mc_lab <- paste0(names(mca_pr), " (", mca_pr, "%)")
fviz_mca_var(mca, choice = "mca.cor", ggtheme = theme_minimal())
library(dplyr)
library(tidyr)
library(ggplot2)
BPRS <- read.csv("BPRS.csv", sep = ",", header = TRUE)
glimpse(BPRS)
## Observations: 360
## Variables: 5
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
RATS <- read.csv("RATS.csv", sep = ",", header = TRUE)
glimpse(RATS)
## Observations: 176
## Variables: 5
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
## # A tibble: 6 x 6
## ID Group WD Weight time stdWeight
## <fct> <fct> <fct> <int> <int> <dbl>
## 1 1 1 WD1 240 1 -1.73
## 2 2 1 WD1 225 1 -2.83
## 3 3 1 WD1 245 1 -1.37
## 4 4 1 WD1 260 1 -0.271
## 5 5 1 WD1 255 1 -0.637
## 6 6 1 WD1 260 1 -0.271
n <- RATS$time %>% unique() %>% length()
RATSS <- RATS %>%
group_by(Group, time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
ggplot(RATSS, aes(x = time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=2) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = "right") +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
RATS8S <- RATS %>%
filter(time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
glimpse(RATS8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, ...
# Plot boxplot
ggplot(RATS8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight) without WD1")
# Filter the outliers
RATS8S1 <- RATS8S %>%
filter(
(mean > 250 & Group == 1) |
(mean < 550 & Group == 2) |
(mean > 500 & Group == 3)
)
# Plot again
glimpse(RATS8S1)
## Observations: 13
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean <dbl> 263.2, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, ...
ggplot(RATS8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight) without WD1")
RATS8S2 <- RATS8S %>%
mutate(baseline = RATS$WD1)
#fit <- lm(mean ~ baseline + Group, data = RATS8S2)
#anova(fit)
library(lme4)
library(lmerTest)
library(afex)
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week * treatment), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood . t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 45.8817 2.4413 92.0743 18.794 <2e-16 ***
## week -2.2704 0.2084 340.0000 -10.896 <2e-16 ***
## treatment 0.5722 1.0761 340.0000 0.532 0.595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.341
## treatment -0.661 0.000
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood . t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 45.8817 2.5685 49.2111 17.863 < 2e-16 ***
## week -2.2704 0.2977 19.9985 -7.626 2.42e-07 ***
## treatment 0.5722 1.0405 320.0007 0.550 0.583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.477
## treatment -0.608 0.000
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood . t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0513 -0.6270 -0.0768 0.5287 3.9259
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0076 8.0627
## week 0.9686 0.9842 -0.51
## Residual 96.4708 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 50.1767 3.5159 141.9981 14.271 < 2e-16 ***
## week -3.3442 0.6711 253.0304 -4.983 1.16e-06 ***
## treatment -2.2911 1.9090 319.9966 -1.200 0.2310
## week:treatment 0.7158 0.4010 319.9966 1.785 0.0752 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week trtmnt
## week -0.768
## treatment -0.814 0.753
## week:trtmnt 0.684 -0.896 -0.840
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitted1 <- fitted(BPRS_ref1)
BPRS <- BPRS %>%
mutate(Fitted1)
#ggplot(BPRS, aes(x = week, y = Fitted1, group = treatment)) +
#geom_line(aes(linetype = subject)) +
#scale_x_continuous(name = "Week") +
#scale_y_continuous(name = "Fitted(bprs") +
#theme(legend.position = "top")